In machine learning, support vector machines (SVMs, also support vector networks[1]) are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training examples, each marked for belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples into one category or the other, making it a non-probabilistic binary linear classifier. An SVM model is a representation of the examples as points in space, mapped so that the examples of the separate categories are divided by a clear gap that is as wide as possible. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall on.
In [ ]:
# SVM Regression
import numpy as np
from sklearn import datasets
from sklearn.svm import SVR
import pandas as pd
In [ ]:
# load the diabetes datasets
# for info on this dataset, refer to the linear_regression script
dataset = datasets.load_diabetes()
In [ ]:
#Let us now build a pandas dataframe hosting the data at hand
# We first need the list of feature names for our columns
# BMI is the Body Mass Index
# ABP is the Average Blood Pressure
lfeat = ["Age", "Sex", "BMI", "ABP", "S1", "S2", "S3", "S4", "S5", "S6"]
In [ ]:
#Let us now build a pandas dataframe hosting the data at hand
# We first need the list of feature names for our columns
# BMI is the Body Mass Index
# ABP is the Average Blood Pressure
lfeat = ["Age", "Sex", "BMI", "ABP", "S1", "S2", "S3", "S4", "S5", "S6"]
In [ ]:
# We now build the Dataframe, with the data as argument
# and the list of column names as keyword argument
df_diabetes = pd.DataFrame(dataset.data, columns = lfeat)
In [ ]:
# We also want to add the regression target
# Let's create a new column :
df_diabetes["Target"] = dataset.target # Must have the correct size of course
In [ ]:
# Let's have a look at the first few entries
print "Printing data up to the 5th sample"
print df_diabetes.iloc[:5,:] # Look at the first 5 samples for all features.
In [ ]:
# We are now going to fit a SVR model to the data
# Please have a look at svm_classification.py first
# SVR is basically an adaptation of the SVM method to regression
# where we modify the constraints of the optimisation problem
# so that the prediction target does not deviate from the model
# more than a specified threshold
#As before, we create an instance of the model
model = SVR()
In [ ]:
# Which we then fit to the training data X, Y
# with pandas we have to split the df in two :
# the feature part (X) and the target part (Y)
# This is done below :
data = df_diabetes[lfeat].values
target = df_diabetes["Target"].values
model.fit(data, target)
print(model)
In [ ]:
# as before, we can use the model to make predictions on any data
predicted = model.predict(data)
mse = np.mean((predicted-expected)**2)
# and evaluate the performance of the classification with standard metrics
print(mse)
print(model.score(data, target))
Support vector regression (SVR) will also find the coefficients.
But the difference lies in how it proceeds to find them SVR can fit a straight line to the data.
But it does so with a geometric interpretation. It will find the line such that all points are at a distance D or less from this line. See below
D is specified in the construction of the model and can be chosen so as to minimise the prediction error
In [2]:
from IPython.display import Image
Image('figures/svm_regression.png')
Out[2]:
In [ ]: